Supervised Learning - Foundations Project: ReCell

Context

Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.

Objective

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.

Data Description

The data contains the different attributes of used/refurbished phones and tablets. The data was collected in the year 2021. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Dataset contains 3454 rows and 15 columns

Most of the datatypes are float with 4 object and 2 integer columns. It appears not all the rows are the same length, so adjusting for missing data will be addressed later

Copy the original data dataframe into a new dataframe ensuring no duplication and change all object columns to category datatype.

Several columns appear to be missing values. Main Camera MP has the most missing values

Exploratory Data Analysis (EDA)

Questions:

  1. What does the distribution of normalized used device prices look like?
  2. What percentage of the used device market is dominated by Android devices?
  3. The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?
  4. A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?
  5. Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?
  6. A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?
  7. Which attributes are highly correlated with the normalized price of a used device?

Normal used price distribution looks roughly normal with some outliers on both sides.

Normalized new price distribution also looks roughly normal with outliers on both sides.

Weight distribution looks roughly normal, but is highly skewed to the right.

Ram does not have a normal distribution.

Days used does not have a normal distribution, but it has no outliers.

Release year is also not normally distributed, but with no outliers.

Battery is not normally distributed with a high right skew to the data.

Internal Memory is also not normally distributed with a heavy right skew. Due to the units it is skewed, it may benefit from a log fitting for the model.

Both Selfie and Main camera megapixels are not normally distributed and skewed to the right with outliers.

What percentage of the used device market is dominated by Android devices?

93% of all devices sold are Android.

The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?

RAM offered ranges from 0.02 - 12 GB across brands. Several Brands offer up to 12 GB RAM (Huawei, Motorola, OnePlus, Oppo, Samsung, and Xiaomi) on set devices. Most brands offer 4 GB of Ram.

A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?

Weight is positively correlated with battery and has a 71% correlation. So generally as weight increases, so does the energy capacity of the battery.

Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?

6 inches = 15.24 cm

A total of 3,454 devices have screens larger than 6 inches across all the brands.

A lot of devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of devices offering greater than 8MP selfie cameras across brands?

Above is the boxplot and summary statistics for brand name devices with selfie cameras larger than 8 MP to better understand the distributions of devices with greater than 8 MP selfie cameras.

Which attributes are highly correlated with the normalized price of a used device?

Normalized new price correlates the highest with 83%. The next highest correlation attributes are in the 40-50% correlation, so not particularly high, but worth looking into (main_camera_mp (54%), ram (53%), battery (47%), screen_size (46%)).

Data Preprocessing

Main Camera MP has the most missing values with 179 missing. Weight is next with 7 values missing, followed by battery with 6 missing values, internal memory and ram with 4 values each, finally selfie camera mp has 2 missing values. Let's explore the main camera mp in more detail

Checking to see if there is any correlation between the selfie camera MP and the main camera MP.

42% correlation is quite low, so not best to correlate the main camera and selfie camera MPs.

The median value of 8 MP seems like a good fit for the missing values. The mean value is 9.46 which does not make sense as a MP value for a main camera. Therefore we will fill all missing values with the median of 8.

Median selfie cameras for the Google brand name is 8 MP. Since both missing values are Google brand, we will replace those missing values with 8 MP.

The mean weight for the XOLO brand is right skewed, so replacing missing values with the median weight (126) makes more sense.

Since the os for all missing values is Others, we can see the median value for those is .06. We will replace all missing values with .06.

Since the os for all missing values is Others, we can see the median value for those is .02. We will replace all missing values with .02.

There is a large difference between the mean and median battery sizes. Since the missing values are across brands and os, I think it's best to drop these 7 rows.

No more missing values in the Updated dataframe.

Updated the 4g and 5g columns from 'yes' and 'no' to 1 and 0 for modeling.

Created dummy columns for all categorical columns.

There are outliers in the screen size, main camera mp, selfie camera mp, internal memory, ram, battery, weight, normalized used price, and normalized new price. I think these are valid and not worth dropping.

Internal memory looked to be very skewed, so a log function was applied to create a closer to normal distribution.

No missing values left in dataset. Data is ready for another EDA and modeling.

EDA

We see a very high correlation between os IOS and Apple brand name since they only run the iOS on the Apple phones. In order to remove these highly correlative variables, we will remove the brand_name_Apple column. Next graph zooms into areas where we may have higher correlations to observe.

We see a fairly high correlation with battery and weight with screen size. I do not think we need to adjust any variables, but keep an eye on those when modeling. I will remove the brand_name_Apple column so we do not get interference with the model.

Data looks good to begin building the model.

Building a Linear Regression model

Not all of the correlations are perfectly linear, but a linear fit would capture a good portion of the data.

First pass fit of the model with the data.

Model performance evaluation

Columns that have a VIF over 5 include: screen_size, weight, brand_name_Huawei, brand_name_Others, and brand_name_Samsung. I will systematically drop them to see if multicollinearity exists and if we can improve our model.

Dropping brand_name_Others and screen_size removes the multicollinearity in the dataset. With multicollinearity removed, we can reliably review the p-values in the dataset and remove unneeded predictor variables.

If the p-value is larger than 0.05 (our 95% confidence interval), we can remove it since it is not significant. The following columns satisfy that removal: 5g, battery, days_used, brand_name_Alcatel, brand_name_BlackBerry, brand_name_Celkon, brand_name_Coolpad, brand_name_Gionee, brand_name_Google, brand_name_HTC, brand_name_Honor, brand_name_Huawei, brand_name_Infinix, brand_name_Karbonn, brand_name_LG, brand_name_Lava, brand_name_Meizu, brand_name_Micromax, brand_name_Microsoft, brand_name_Motorola, brand_name_OnePlus, brand_name_Oppo, brand_name_Panasonic, brand_name_Samsung, brand_name_Sony, brand_name_Spice, brand_name_Vivo, brand_name_XOLO, brand_name_ZTE, os_Windows, int_memory_log.

After dropping features with multicollinearity and the statistically insignificant, the model performance has not dropped significantly (.846 to.840)

Checking Linear Regression Assumptions

These assumptions are essential conditions that should be met before we draw inferences regarding the model estimates or use the model to make a prediction.

For Linear Regression, we need to check if the following assumptions hold:-

-Linearity -Independence -Homoscedasticity -Normality of error terms -No strong Multicollinearity

There does not appear to be a significant pattern to the residuals, data points seem randomly distributed.

Since p-value < 0.05, the residuals are not normal as per shapiro test. As an approximation, we can accept this distribution as close to being normal.

Since p-value > 0.05 we can say that the residuals are homoscedastic.

This satisfies all the assumptions of linear regression, time to review the final model.

Final Model Summary

We can see that RMSE on the train and test sets are comparable. So, our model is not suffering from overfitting. MAE indicates that our current model is able to predict normalized used phone price within a mean error of 0.18 units on the test data. Hence, we can conclude the model "ols_res10" is good for prediction as well as inference purposes.

Actionable Insights and Recommendations

-

Our model statistically shows that the most relevant variables when calculating used phone prices are as follows: 4g capabilities, main camera megapixels, selfie camera megapixels, RAM, weight, release year, normalized new price, brand names (Asus, Lenovo, Nokia, Realme, and Xiami), operating systems with Others and Apple.